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IDENTIFICATION OF NUCLEOTIDES, AMINO ACIDS, OR CARBOHYDRATES BY MASS 
SPECTROMETRY 



Government support: 
Certain aspects of this invention were made with 
partial support under grant 8809710 from the National Science 
Foundation and grant R01GM52095 from the National Institutes 
of Health. The U.S. Government may have certain rights in 
this invention. 

Related Application 
The present application is a continuation-in-part 
of U.S. Serial No. 08/212,433, filed March 14, 1994, which is 
incorporated herein by reference. 

4 

Background Of The Invention 

A number of approaches have been used in the past 
for applying the analytic power of mass spectrometry to 
peptides. Tandem mass spectrometry (MS/MS) techniques have 
been particularly useful. In tandem mass spectrometry, the 
peptide or other input (commonly obtained from a 
chromatography device) is applied to a first mass spectrometer 
which serves to select, from a mixture of peptides, a target 
peptide of a particular mass. The target peptide is then 
activated or fragmented to produce a mixture of the "target" 
or parent peptide and various component fragments, typically 
peptides of smaller mass. This mixture is then transmitted to 
a second mass spectrometer which records a fragment spectrum. 
This fragment spectrum will typically be expressed in the form 
of a bar graph having a plurality of peaks, each peak 
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indicating the mass-to-change ratio (m/z) of a detected 
fragment and having an intensity value. 

Although the bare fragment spectrum can be of some 
interest, it is often desired to use thie fragment spectrum to 
identify the peptide (or the parent protein) which resulted in 
the fragment mixture. Previous approaches have typically 
involved using the fragment spectrum as a basis for 
hypothesizing one or more candidate amino acid sequences. 
This procedure has typically involved human analysis by a 
skilled researcher, although at least one automated procedure 
has been described. John Yates, III, et al. , "Computer Aided 
Interpretation of Low Energy MS/MS Mass Spectra of Peptides" 
Techniques In Protein Chemistry II (1991) , pp. 477-485, 
incorporated herein by reference. The candidate sequences can 
then be compared with known amino acid sequences of various 
proteins in the protein sequence libraries. 

The procedure which involves hypothesizing 
candidate amino acid sequences based on fragment spectra is 
useful in a number of contexts but also has certain 
difficulties. Interpretation of the fragment spectra so as to 
produce candidate amino acid sequences is time-consuming, 
often inaccurate, highly technical and in general can be 
performed only by a few laboratories with extensive experience 
in tandem mass spectrometry. Reliance on human interpretation 
often means that analysis is relatively slow and lacks strict 
objectivity. Approaches based on peptide mass mapping are 
limited to ^peptide masses derived from an intact homogenous 
protein generated by specific and known proteolytic cleavage 
and thus are not generally applicable to mixtures of proteins. 

Accordingly, it would be useful to provide a system 
for correlating fragment spectra with known protein sequences 
while avoiding the delay and/or subjectivity involved in 
hypothesizing or deducing candidate amino acid sequences from 
the fragment spectra. 
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Summary Of The Invention 
According "to the present invention, known amino 
acid sequences, e.g. , in a protein sequence library, are used 1 
to calculate or predict one or more candidate fragment 
spectra. The predicted fragment spectra are then compared 
with an experimentally-derived fragment spectrum to determine 
the best match or matches. Preferably, the parent peptide, 
from which the fragment spectrum was derived has a known mass. 
Sub- sequences of the various sequences in the protein 
sequence library are analyzed to identify those sub-sequences 
corresponding to a peptide whose mass is equal to (or within a 
given tolerance of) the mass of the parent peptide in the 
fragment spectrum. For each sub-sequence having the proper 
mass, a predicted fragment spectrum can be calculated, e.g., 
by calculating masses of various amino acid subsets of the 
candidate peptide. The result will be a plurality of 
candidate peptides, each with a predicted fragment spectrum. 
The predicted fragment spectra can then be compared with the 
fragment spectrum derived from the tandem mass spectrometer to 
identify one or more proteins having sub-sequences which are 
likely to be identical with the sequence of the peptide which 
resulted in the experimentally-derived fragment spectrum. 

Brief Description Of The Drawings 
Fig. 1 is a block diagram depicting previous 

methods for correlating tandem mass spectrometer data with 

sequences from a protein sequence library; 

Fig. 2 is a block diagram showing a method for 

correlating tandem mass spectrometer data with sequences from 

a protein sequence library according to an embodiment of the 

present invention; 

Fig. 3 is a flow chart showing steps for 
correlating tandem mass spectrometry data with amino acid 
sequences, according to an embodiment of the present 
invention; 
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Fig. 4 is a flow diagram showing details of a 
method for the step of identifying candidate sub-sequences of 
Fig. 3; 

Fig. 5 is a fragment mass spectrum for a peptide of 
a type that can be used in connection with the present 
invention; and 

Figs. 6A-6D are flow charts showing an analysis 
method, according to an embodiment of the present invention. 

Description Of The Specific Embodiments 
Before describing the embodiments of the present 
invention, it will be useful to describe, in greater detail, a 
previous method. As depicted in Fig. 1, the previous method 
is used for analysis of an unknown peptide 12. Typically the 
peptide will be output from a chromatography column which has 
been used to separate a partially fractionated protein. The 
protein can be fractionated by, for example, gel filtration 
chromatography and/or high performance liquid chromatography 
(HPLC) . The sample 12 is introduced to a tandem mass 
spectrometer 14 through an ionization method such as 
electrospray ionization (ES) . In the first mass spectrometer, 
a peptide ion is selected, so that a targeted component of a 
specific mass, is separated from the rest of the sample 14a. 
The targeted component is then activated or decomposed. In 
the case of a peptide, the result will be a mixture of the 
ionized parent peptide ("precursor ion") and component 
peptides of lower mass which are ionized to various states. A 
number of activation methods can be used including collisions 
with neutral gases (also referred to as collision induced 
dissolution). The parent peptide and its fragments are then 
provided to the second mass spectrometer 14c, which outputs an 
intensity and m/z for each of the plurality of fragments in 
the fragment mixture. This information can be output as a 
fragment mass spectrum 16. Fig. 5 provides an exampl of such 
a spectrum 16. In the spectrum 16 each fragment ion is 
repres nted as a bar graph whose abscissa value indicates the 
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mass-to-charge ratio (m/z) and whose ordinate value represents 
intensity. According to previous methods, in order to 
correlate a fragment spectrum with sequences from a protein 
sequence library, a fragment sequence was converted into one 
or more amino acid sequences judged to correspond to the 
fragment spectrum. In one strategy, the weight of each of the 
amino acids is subtracted from the molecular weight of the 
parent ion to determine what might be the molecular weight of 
a fragment assuming, respectively, each amino acid is in the 
terminal position. It is determined if this fragment mass is 
found in the actual measured fragment spectrum. Scores are 
generated for each of the amino acids and the scores are 
sorted to generate a list of partial sequences for the next 
subtraction cycle. Cycles continue until subtraction of the 
mass of an amino acid leaves a difference of less than 0.5 and 
greater than -0.5. The result is one or more candidate amino 
acid sequences 18. This procedure can be automated as 
described, for example, in Yates III (1991) supra . One or 
more of the highest-scoring candidate sequences can then be 
compared 21 to sequences in a protein sequence library 2 0 to 
try to identify a protein having a sub-sequence similar or 
identical to the sequence believed to correspond to the 
peptide which generated the fragment spectrum 16. 

Fig. 2 shows an overview of a process according to 
the present invention. According to the process of Fig. 2, a 
fragment spectrum 16 is obtained in a manner similar to that 
described above for the fragment spectrum depicted in Fig. 1. 
Specifically, the sample 12 is provided to a tandem mass 
spectrometer 14 . Procedures described below use a two-step 
process to acquire ms/ms data. However the present invention 
can also be used in connection with mass spectrometry 
approaches currently being developed which incorporate 
acquisition of ms/ms data with a single step. In one 
embodiment ms/ms spectra would be acquired at each mass. The 
first ms would separate the ions by mass-to-charge and the 
second would record the ms/ms spectrum. The second stage of 
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ms/ms would acquire, e.g. 5 to 10 spectra at each mass 
transformed by the first ms. 

A number of mass spectrometers can be used 
including a triple-quadruple mass spectrometer, a Fourier- , 
transform cyclotron resonance mass spectrometer, a tandem 
time-of -flight mass spectrometer and a quadrupole ion trap 
mass spectrometer. In the process of Fig. 2, however, it is 
not necessary to use the fragment spectrum as a basis for 
hypothesizing one or more amino acid sequences. In the 
process of Fig. 2, sub-sequences contained in the protein 
sequence library 20 are used as a basis for predicting a 
plurality of mass spectra 22, e.g., using prediction 
techniques described more fully below. 

A number of sequence libraries can be used, 
including, for example, the Genpept database, the GenBank 
database (described in Burks, et al., "GenBank: Current status 
and future directions in Methods in Enzymology* 1 , 183:3 
(1990)), EMBL data library (described in Kahn, et al., "EMBL 
Data Library , " Methods in Enzymology , 183:23 (1990)), the 
Protein Sequence Database (described in Barker, et al., 
"Protein Sequence Database," Methods in Enzvmolocrv . 1983:31 
(1990), SWISS-PROT (described in Bairoch, et al., "The SWISS- 
PROT protein sequence data bank, recent developments," Nucleic 
Acids Res. . 21:3093-3096 (1993)), and PIR-International 
(described in "Index of the Protein Sequence Database of the 
International Association of Protein Sequence Databanks (PIR- 
International)" Protein Seq Data Anal. 5:67-192 (1993). 

The predicted mass spectra 22 are compared 24 to 
the experimentally-derived fragment spectrum 16 to identify 
one or more of the predicted mass spectra which most closely 
match the experimentally-derived fragment spectrum 16. 
Preferably the comparison is done automatically by calculating 
a closeness-of-f it measure for each of the plurality of 
predicted mass spectra 22 (compared to the experimentally- 
derived fragment spectrum 16) . It is believed that, in 
general, there is high probability that the peptide analyzed 
by the tandem mass spectrometer has an amino acid sequence 
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identical to one of the sub-sequences taken from the protein 
sequence library 20 which resulted in a predicted mass 
spectrum 22 exhibiting a high closeness-of-f it with respect to 
the experimentally-derived fragment spectrum 16, Furthermore, 
when the peptide analyzed by the tandem mass spectrometer 14 
was derived from a protein, it is believed there is a high 
probability that the parent protein is identical or similar to 
the protein whose sequence in the protein sequence library 20 
includes a sub-sequence that resulted in a predicted mass 
spectra 22 which had a high closeness-of-f it with respect to 
the fragment spectrum 16. Preferably, the entire procedure 
can be performed automatically using, e.g, a computer to 
calculate predicted mass spectra 22 and/or to perform 
comparison 24 of the predicted mass spectra 22 with the 
experimentally-derived fragment spectrum 16. 

Fig. 3 is a flow diagram showing one method for 
predicting mass spectra 22 and performing the comparison 24. 
According to the method of Fig. 3, the experimentally-derived 
fragment spectrum 16 is first normalized 32. According to one 
normalization method, the experimentally-derived fragment 
spectrum 16 is converted to a list of masses and intensities. 
The values for the precursor ion are removed from the file. 
The square root of all the intensity values is calculated and 
normalized to a maximum intensity of 100. The 200 most 
intense ions are divided into ten mass regions and the maximum 
intensity is normalized to 100 within each region. Each ion 
which is within 3.0 daltons of its neighbor on either side is 
given the greater intensity value, if a neighboring intensity 
is greater than its own intensity. Of course, other 
normalizing methods can be used and it is possible to perform 
analysis without performing normalization, although 
normalization is, in general, preferred. For example, it is 
possible to use maximum intensities with a value greater than 
or less than 100. It is possible to select more or fewer than 
the 200 most intense ions. It is possible to divide into more 
or fewer than 10 mass regions. It is possible to make the 
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window for assuming the neighboring intensity value to be 
greater than or less than 3.0 daltons. 

In order to generate predicted mass spectra from a 
protein sequence library, according to the process of Fig. 3<, 
sub-sequences within each protein sequence are identified 
which have a mass which is within a tolerance amount of the 
mass of the unknown peptide. As noted above, the mass of the 
unknown peptide is known from the tandem mass spectrometer 34. 
Identification of candidate sub-sequences 34 is shown in 
greater detail in Fig. 4. In general, the process of 
identifying candidate sub-sequences involves summing the 
masses of linear amino acid sequences until the sum is either 
within a tolerance of the mass of the unknown peptide (the 
"target" mass) or has exceeded the target mass (plus 
tolerance) . If the mass of the sequence is within tolerance 
of the target mass, the sequence is marked as a candidate. If 
the mass of the linear sequence exceeds the mass of the 
unknown peptide, then the algorithm is repeated, beginning 
with the next amino acid position in the sequence. 

According to the method of Fig. 4, a variable m, 
indicating the starting amino acid along the sequence is 
initialized to 0 and incremented by 1 (36, 38). The sum, 
representing the cumulative mass and a variable n representing 
the number of amino acids thus far calculated in the sum, are 
initially set to 0 (4 0) and variable n is incremented 42. The 
molecular weight of a peptide corresponding to a sub-sequence 
of a protein sequence is calculated in iterative fashion by 
steps 44 and 46. In each iteration, the sum is incremented by 
the molecular weight of the amino acid of the next (nth) amino 
acid in the sequence 44. Values of the sum 44 may be stored 
for use in calculating fragment masses for use in predicting a 
fragment mass spectrum as described below. If the resulting 
sum is less than the target mass decremented by a tolerance 
46, the value of n is incremented 42 and the process is 
repeated 44. A nu m ber of tolerance values can be used. In 
one embodiment, a tolerance value of ±0.05% of the mass of the 
unknown peptide was used. If the new -sum is no longer less 
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than a tolerance amount below the target mass, it is then 
determined if the new sum is greater than the target mass plus 
the tolerance amount. If the new sum is more than the 
tolerance amount in excess of the target mass, this particular 
sequence is not considered a candidate sequence and the 
process begins again, starting from a new starting point in 
the sequence (by incrementing the starting point value m 
(38)). If, however, the sum is not greater than the target 
mass plus the tolerance amount, it is known that the sum is 
within one tolerance amount of a target mass and, thus, that 
the sub-sequence beginning with mth amino and extending to the 
(m + n)th amino acid of the sequence is a candidate sequence. 
The candidate sequence is marked, e.g., by storing the values 
of m and n to define this sub-sequence. 

Returning to Fig. 3, once a plurality of candidate 
sub-sequences have been identified, a fragment mass spectrum 
is predicted for each of the candidate sequences 52. The 
fragment mass spectrum is predicted by calculating the 
fragment ion masses for the type b- and y- ions for the amino 
acid sequence. When a peptide is fragmented and the charge is 
retained on the N-terminal cleavage fragment, the resulting 
ion is labelled as a b-type ion. If the charge is retained on 
the c-type terminal fragment, it is labelled a y-type ion. 
Masses for type b- ions were calculated by summing the amino 
acid masses and adding the mass of a proton. Type y- ions 
were calculated by summing, from the c-terminus, the masses of 
the amino acids and adding the mass of water and a proton to 
the initial amino acid. In this way, it is possible to 
calculate an m/z for each fragment. However, in order to 
provide a predicted mass spectrum, it is also necessary to 
assign an intensity value for each fragment. It might be 
possible to predict, on a theoretical basis, intensity value 
for each fragment. However, this procedure is difficult. It 
has been found useful to assign intensities in the following 
fashion. The value of 50.0 is assigned to each b and y ion. 
To masses of 1 dalton on either side of the fragment ion, an 
intensity of 25.0 is assigned. P ak intensities of 10.0 and - 
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17.0 and -18.0 daltons below the m/z of each b- and y- ion 
location (for both NH 3 and H 2 0 loss) , and peak intensities of 
10.0 and -28.0 amu of each type b ion location (for type a- 
ions) . 

Returning to Fig. 3, after calculation of predicted 
m/z values and assignment of intensities, it is preferred to 
calculate a measure of closeness-of-f it between the predicted 
mass spectra 22 and the experimentally-derived fragment 
spectrum 16. A number of methods for calculating closeness- 
of-f it are available. In the embodiment depicted in Fig. 3, a 
two-step method is used 54 . The two-step method includes 
calculating a preliminary closeness-of-f it score, referred to 
here as S p 56 and, for the highest-scoring amino acid 
sequences, calculating a correlation function 58. According 
to one embodiment, S p is calculated using the following 
formula: 

where i m = matched intensities, n ± = number of matched 
fragment ions, f3 = type b- and y-ion continuity, p = presence 
of immonium ions and their respective amino acids in the 
predicted sequence, n t = total number of fragment ions. The 
factor, /?, evaluates the continuity of a fragment ion series. 
If there was a fragment ion match for the ion immediately 
preceding the current type b- or y-ion, /? is incremented by 
0.075 (from an initial value of 0.0). This increases the 
preliminary score for those peptides matching a successive 
series of type b- and y-ions since extended series of ions of 
the same type are often observed in MS/MS spectra. The factor 
p evaluates the presence of immonium ions in the low mass end 
of the mass spectrum. Immonium ions are diagnostic for the 
presence of some types of amino acids in the sequence. If 
immonium ions are present at 110.0, 120.0, or 136.0 Da (+ 1.0 
Da) in the processed data file of the unknown peptide with 
normalized intensities greater than 40.0, indicating the 
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presence of histidine, phenylalanine,, and tyrosine 
respectively, then the sequence under evaluation is checked 
for the presence of the amino acid indicated by the immonium 
ion. The preliminary score, S p , for the peptide is either ' 
augmented or depreciated by a factor of (1 - p) where p is the 
sum of the penalties for each of the three amino acids whose 
presence is indicated in the low mass region. Each individual 
p can take on the value of -0.15 if there is a corresponding 
low mass peak and the amino acid is not present in the 
sequence, +0.15 if there is a corresponding low mass peak and 
the amino acid is present in the sequence, or 0.0 if the low 
mass peak is not present. The total penalty can range from 
-0.45 (all three low mass peaks present in the spectrum yet 
none of the three amino acids are in the sequence) to +0.45 
(all three low mass peaks are present in the spectrum and all 
three amino acids are in the sequence) . 

Following the calculation of the preliminary 
closeness-of-f it score S p , those candidate predicted mass 
spectra having the highest S p scores are selected for further 
analysis using the correlation function 58. The number of 
candidate predicted mass spectra which are selected for 
further analysis will depend largely on the computational 
resources and time available. in one embodiment, 300 
candidate peptide sequences with the highest preliminary score 
were selected. 

For purposes of calculating the correlation 
function, 58, the experimentally-derived fragment spectrum is 
preprocessed in a fashion somewhat different from 
preprocessing 32 employed before calculating S p . For purposes 
of the correlation function, the precursor ion was removed 
from the spectrum and the spectrum divided into 10 sections. 
Ions in each section were then normalized to 50.0. The 
sectionwise normalized spectra 60 were then used for 
calculating the correlation function. According to one 
embodiment, the discrete correlation between the two functions 
is calculated as: 
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ia-l 

= J^^iVi^ (2) 



1=0 



where t is a lag value. The discrete correlation theorem 
states that the discrete correlation of two real functions x 
and y is one member of the discrete Fourier transform pair 

* T -* T r*T (3 ) 



where X(t) and Y(t) are the discrete Fourier transforms of 
x(i) and y(i) and the Y* denotes complex conjugation. 
Therefore, the cross-correlations can be computed by Fourier 
transformation of the two data sets using the fast Fourier 
transform (FFT) algorithm, multiplication of one transform by 
the complex conjugate of the other, and inverse transformation 
of the resulting product. In one embodiment, all of the 
predicted spectra as well as the pre-processed unknown 
spectrum were zero-padded to 4096 data points since the MS/MS 
spectra are not periodic (as intended by the correlation 
theorem) and the FFT algorithm requires N to be an integer 
power of two, so the resulting end effects need to be 
considered. The final score attributed to each candidate 
peptide sequence is R(0) minus the mean of the 
cross-correlation function over the range -75<t<75. This 
modified "correlation parameter" described in Powell and 
Heiftje, Anal. Chim. Acta. Vol. 100, pp 313-327 (1978) shows 
better discrimination over just the spectral correlation 
coefficient R(0) . The raw scores are normalized to 1.0. In 
one embodiment, output 62 includes the normalized raw score, 
the candidate peptide mass, the unnormalized correlation 
coefficient, the preliminary score, the fragment ion 
continuity /?, the immonium ion factor p, the number of type b- 
and y-ions matched out of the total number of fragment ions, 
their matched intensities, the protein accession number, and 
the candidate peptide sequence. 

If desired, the correlation function 58 can be used 
to automatically select one of the predicted mass spectra 22 
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as corresponding to the experimentally-derived fragment 
spectrum 16. Preferably, however, a number of sequences from 
the library 20 are output and final selection of a single 
sequence is done by a skilled operator. 

In addition to predicting mass spectra from protein 
sequence libraries, the present invention also includes 
predicting mass spectra based on nucleotide databases. The 
procedure involves the same algorithmic approach of cycling 
through the nucleotide sequence. The 3-base codons will be 
converted to a protein sequence and the mass of the amino 
acids summed in a fashion similar to the summing depicted in 
Fig. 4. To cycle through the nucleotide sequence, a 1-base 
increment will be used for each cycle. This will allow the 
determination of an amino acid sequence for each of the three 
reading frames in one pass. The scoring and reporting 
procedures for the search can be the same as that described 
above for the protein sequence database. 

Depending on the computing and time resources 
available, it may be advantageous to employ data-reduction 
techniques. Preferably these techniques will emphasize the 
most informative ions in the spectrum while not unduly 
affecting search speed. One technique involves considering 
only some of the fragment ions in the MS/MS spectrum, a 
spectrum for a peptide may contain as many as 3,000 fragment 
ions. According to one data reduction strategy, the ions are 
ranked by intensity and some fraction of the most intense ions 
(e.g., the top 200 most intense ions) will be used for 
comparison. Another approach involves subdividing the 
spectrum into, e.g., 4 or 5 regions and using the 50 most 
intense ions in each region as part of the data set. Yet 
another approach involves selecting ions based on the 
probability of those ions being sequence ions. For example, 
ions could be selected which exist in mass windows of 57 
through 186 daltons (range of mass increments for the 20 
common amino acids from GLY to TRP) that contain diagnostic 
features of type b- or y- ions, such as losses of 17 or 18 
daltons (NH 3 or H 2 0) or a loss of 28 daltons (CO) . 



WO 95/25281 



PCIYUS95/03239 



14 

The techniques described above are, in general, 
applicable to spectra of peptides with charged states of +1 or 
+2, typically having a relatively short amino acid sequence. 
Using a longer amino acid sequence increases the probability 
of a unique match to a protein sequence. However, longer 
peptide sequences have a greater likelihood of containing more 
basic amino acids, and thus producing ions of higher charge 
state under electro-spray ionization conditions. According to 
one embodiment of the invention, algorithms are provided for 
searching a database with MS /MS spectra of highly charged 
peptides (+3, +4, +5, etc.). According to one approach, the 
search program will include an input for the charge state (N) 
of the precursor ion used in the MS/MS analysis. Predicted 
fragment ions will be generated for each charge state less 
than N. Thus, for peptide of +4, the charge states of +1, +2 
and +3 will be generated for each fragment ion and compared to 
the MS/MS spectrum. 

The second strategy for use with multiply charged 
spectra is the use of mathematical deconvolution to convert 
the multiply charged fragment ions to their singly charged 
masses. The deconvoluted spectrum will then contain the 
fragment ions for the multiply charged fragment ions and their 
singly charged counterparts. 

To speed up searches of the database, a directed- 
search approach can be used. In cases where experiments are 
performed on specific organisms or specific types of proteins, 
it is not necessary to search the entire database on the first 
pass. Instead, a search sequence protein specific to a 
species or a class of proteins can be performed first. If the 
search does not provide reasonable answers, then the entire 
database is searched. 

A number of different scoring algorithms can be 
used for determining preliminary closeness of fit or 
correlation. In addition to scoring based on the number of 
matched ions multiplied by the sum of the intensity, scoring 
can be based on the percentage of continuous sequence coverage 
represented by the sequence ions in the spectrum. For 
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example, a 10 residue peptide will potentially contain 9 each 
of b- and y-type sequence ions. If a set of ions extends from 
B x to B 9 , then a score of 100 is awarded, but if a 
discontinuity is observed in the middle of the sequence, such 
as missing the B 5 ion, a penalty is assessed. The maximum 
score is awarded for an amino acid sequence that contains a 
continuous ion series in both the b and y directions. 

In the event the described scoring procedures do 
not delineate an answer, an additional technique for spectral 
comparison can be used in which the database is initially 
searched with a molecular weight value and a reduced set of 
fragment ions. Initial filtering of the database occurs by 
matching sequence ions and generating a score with one of the 
methods described above. The resulting set of answers will 
then undergo a more rigorous inspection process using a 
modified full MS /MS spectrum. For the second stage analysis, 
one of several spectral matching approaches developed for 
spectral library searching is used. This will require 
generating a "library spectrum" for the peptide sequence based 
on the sequence ions predicted for that amino acid sequence. 
Intensity values for sequence ions of the "library spectrum" 
will be obtained from the experimental spectrum. If a 
fragment ion is predicted at m/z 256, then the intensity value 
for the ion in the experimental spectrum at m/z=256 will be 
used as the intensity of the ion in the predicted spectrum. 
Thus, if the predicted spectrum is identical to the "unknown" 
spectrum, it will represent an ideal spectrum. The spectra 
will then be compared using a correlation function. In 
general, it is believed that the majority of computational 
time for the above procedure is spent in the iterative search 
process. By multiplexing the analysis of multiple MS/MS 
spectra in one pass through the database, an overall 
improvement in efficiency will be realized. In addition, the 
mass tolerance used in the initial pre-f iltering can affect 
search times by increasing or decreasing the number of 
sequences to analyze in subsequent steps. Another approach to 
speed up searches involves a binary encryption scheme where 
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the mass spectrum is encoded as peak/no peak at every mass 
depending on whether the peak is above a certain threshold 
value. If intensive use of a protein sequence library is 
contemplated, it may be possible to calculate and store 
predicted mass values of all sub-sequenceis within a 
predetermined range of masses so that at least some of the 
analysis can be performed by table look-up rather than 
calculation. 

Figs. 6A-6E are flow charts showing an analysis 
procedure according to one embodiment of the present 
invention. After data is acquired from the tandem mass 
spectrometer, as described above 602, the data is saved to a 
file and converted to an ASCII format 604. At this point, a 
preprocessing procedure is started 606. The user enters 
information regarding the peptide mass in the precursor ion 
charge state 608. Mass/ intensity values are loaded from the 
ASCII file, with the values being rounded to unit masses 610. 
The previously-identified precursor ion contribution of this 
data is removed 612. The remaining data is normalized to a 
maximum intensity of 100 614. At this point, different paths 
can be taken. In one case, the presence of any immonium ions 
(H, F and Y) is noted 616 and the peptide mass and immonium 
ion information is stored in a datafile 618. In another 
route, the 200 most intense peaks are selected 620. If two 
peaks are within a predetermined distance (e.g., 2 amu) of 
each other, the lower intensity peak is set equal to a greater 
intensity 622. After this procedure, the data is stored in a 
datafile for preliminary scoring 624. In another route, the 
data is divided into a number of windows, for example ten 
windows 626. Normalization is performed within each window, 
for example, normalizing to a maximum intensity of 50 628. 
This data is then stored in a datafile for final correlation 
scoring 630. This ends the preprocessing phase, according to 
this embodiment 632. 

The database search is started 634 and the search 
parameters and the data obtained from the preprocessing 
procedure (Fig. 6A) are load d 636. A first batch of database 
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sequences is loaded 638 and a search procedure is run on a 
particular protein 64 0. The search procedure is detailed in 
Fig. 6C. As long as the end of the batch has not been reached* 
the index is incremented 642 and the search routine is 1 
repeated 64 0. Once it is determined that the end of a batch 
has been reached 64 4, as long as the end of the database has 
not been reached, the second index 646 is incremented and a 
new batch of database sequences is loaded 638. Once the end 
of the database has been reached 628, a correlation analysis 
is performed 630 (as detailed in Fig. 6E) , the results are 
printed 632 and the procedure ends 634. 

When the search procedure is started 638 (Fig. 6C) , 
an index II is set to zero 64 6 to indicate the start position 
of the candidate peptide within the amino acid being searched 
640. A second index 12, indicating the end position of the 
candidate peptide within the amino acid being searched, is 
initially set equal to II and the variable Pmass, indicating 
the accumulated mass of the candidate peptide is initialized 
to zero 648. During each iteration through a given candidate 
peptide 650 the mass of the amino acid at position 12 is added 
to Pmass 652. It is next determined whether the mass thus-far 
accumulated (Pmass) equals the input mass (i.e., the mass of 
the peptide) 654. In some embodiments, this test may be 
performed as plus or minus a tolerance rather than requiring 
strict equality, as noted above. If there is equality 
(optionally within a tolerance) an analysis routine is started 
656 (detailed in Fig. 6D) . Otherwise, it is determined 
whether Pmass is less than the input mass (optionally within a 
tolerance). If not, the index 12 is incremented 658 and the 
mass of the amino acid at the next position (the incremented 
12 position) is added to Pmass 652. If Pmass is greater than 
the input mass (optionally by more than a tolerance 660) it is 
determined whether index II is at the end of a protein 662. 
If so, the search routine exits 664. Otherwise, index II is 
incremented 666 so that the routine can start with a new start 
position of a candidate peptide and the search procedure 
returns to block 648. 
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When the analysis procedure is started 670 (Fig. 
6D) , data indicative of b- and y- ions for the candidate 
peptide are generated 672, as described above. It is 
determined whether the peak is within the top 200 ions 674.' 
The peak intensity is summed and the fragmented match index is 
incremented 676. If previous b- or y- ions are matched 678, 
the 0 index is incremented 680. Otherwise, it is determined 
whether all fragment ions have been analyzed. If not, the 
fragment index is incremented 684 and the procedure returns to 
block 674. Otherwise, a preliminary score such as S_, 
described above is calculated 686. If the newly-calculated S p 
is greater than the lowest score 688 the peptide sequence is 
stored 69 0 unless the sequence has already been stored, in 
which case the procedure exits 692. 

At the beginning of the correlation analysis (Fig. 
6E) , a stored candidate peptide is selected 693. A 
theoretical spectrum for the candidate peptide is created 694, 
correlated with experimental data 695 and a final correlation 
score is obtained 696, as described above. The index is 
incremented 697 and the process repeated from block 693 unless 
all candidate peptides have been scored 698, in which case the 
correlation analysis procedure exits 699. 

The following examples are offered by way of 
illustration, not limitation. 

Experimental 
Example #1 

MHC complexes were isolated from HS-EBV cells 
transformed with HLA-DRB*040l using antibody affinity 
chromatography. Bound peptides were released and isolated by 
filtration through a Centricon 10 spin column. The heavy 
chain of glycosaparginase from human leukocytes was isolated. 
Proteolytic digestions were performed by dissolving the 
protein in 50 mM ammonium bicarbonate containing 10 mM Ca ++ , 
pH 8.6. Trypsin was added in a ratio of 100/1 protein/enzyme. 

Analysis of the resulting peptide mixtures was 
performed by LC-MS and LC-MS/MS. Briefly, molecular weights 
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of peptides were recorded by scanning Q3 or Ql at a rate of 
400 Da/sec over a mass range of 300 to 1600 throughout the 
HPLC gradient. Sequence analysis of peptides was performed 
during a second HPLC analysis by selecting the precursor ion 
with a 6 amu (FWHH) wide window in Q x and passing the ions 
into a collision cell filled with argon to a pressure of 3-5 
mtorr. Collision energies were on the order of 20 to 50 eV. 
The fragment ions produced in Q 2 were transmitted to Q 3 and a 
mass range of 50 Da to the molecular weight of the precursor 
ion was scanned at 500 Da/sec to record the fragment ions. 
The low energy spectra of 3 6 peptides were recorded and stored 
on disk. The genpept database contains protein sequences 
translated from nucleotide sequences. A text search of the 
database was performed to determine if the sequences for the 
peptide amino acid sequences used in the analysis were present 
in the database. Subsequently, a second database was created 
from the whole database by appending amino acid sequences for 
peptides not included. 

The spectrum data was converted to a list of masses 
and intensities and the values for the precursor ion were 
removed from the file. The square root of all the intensity 
values was calculated and normalized to a maximum intensity of 
100.0. All ions except the 200 most intense ions were removed 
from the file. The remaining ions were divided into 10 mass 
regions and the maximum intensity normalized to 100.0 within 
each region. Each ion within 3.0 daltons of its neighbor on 
either side was given the greater intensity value, if the 
neighboring intensity was greater than its own intensity. 
This processed data was stored for comparison to the candidate 
sequences chosen from the database search. The MS/MS spectrum 
was modified in a different manner for calculation of a 
correlation function. The precursor ion was removed from the 
spectrum and the spectrum divided into 10 equal sections. 
Ions in each section were then normalized to 50.0. This 
spectrum was used to calculate the correlation coefficient 
against a predicted MS/MS spectrum for each amino acid 
sequence retrieved from the database. 
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Amino acid sequences from each protein were 
generated by summing the masses, using average masses for the 
amino acids, of the linear amino acid sequence from the amino * 
terminus (n) . If the mass of the linear sequence exceeded the 
mass of the unknown peptide, then the algorithm returned to 
the amino terminal amino acid and began summing amino acid 
masses from the n+1 position. This process was repeated until 
every linear amino acid sequence combination had been 
evaluated. When the mass of the amino acid sequence was 
within ±0.05% (minimum of ±1 Da) of the mass of the unknown 
peptide, the predicted m/z values for the type b- and y-ions 
were generated and compared to the fragment ions of the 
unknown sequence. A preliminary score (S p ) was generated and 
the top 300 candidate peptide sequences with the highest 
preliminary score were ranked and stored. A final analysis of 
the top 300 candidate amino acid sequences was performed with 
a correlation function. Using this function a theoretical 
MS/MS spectrum for the candidate sequence was compared to the 
modified experimental MS/MS spectrum. Correlation 
coefficients were calculated, ranked and reported. The final 
results were ranked on the basis of the normalized correlation 
coefficient. 

The spectrum shown in Fig. 5 was obtained by 
LC-MS/MS analysis of a peptide bound to a DRB*0401 MHC class 
II complex. A search of the genpept database containing 
74,938 protein sequences identified 384,398 peptides within a 
mass tolerance of ±0.05% (minimum of ±lDa) of the molecular 
weight of this peptide. By comparing fragment ion patterns 
predicted for each of these amino acid sequences to the 
pre-processed MS/MS spectra and calculating a preliminary 
score, the number of candidate sequences was cutoff at 300. A 
correlation analysis was then performed with the predicted 
MS /MS spectra for each of these sequences and the modified 
experimental MS/MS spectrum. The results of the search 
through the genpept database with the spectrum in Fig. 5 are 
displayed in Table 1. Two peptides of similar sequence, 
DLRSWTAADAAQISK [Seq. ID No. 1], DLRSWTAADAAQISQ [Seq. ID No. 
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2], were identified as the highest scoring sequences (C n 
values). Their correlation coefficients are identical so 
their rankings in Table 1 are arbitrary. The amino acid 
sequence DLRSWTAADAAQISK [Seq. ID No. 1] occurs in five 
proteins in the genpept database while the sequence 
DLRS WTAADAAQ I S Q [Seq. ID No. 2] occurs in only one. The top 
three sequences appear in immunologically related proteins and 
the rest of the proteins appear to have no correlation to one 
another. A second search using the same MS/MS spectrum was 
performed with the Homo sapiens subset of the genpept database 
to compare the results. These data are presented in Table 2. 
In both searches the correct sequence tied for the top 
position. Both amino acid sequences have identical 
correlation coefficients, c n , although the sequences differ by 
Lys and Gin at the C-terminus. These two amino acids have the 
same nominal mass and would be expected to produce similar 
MS/MS spectra. The sum of the normalized fragment ion 
intensities, I m , for the matched fragment ions for the two 
peptides are different with the correct sequence having the 
greater value. The correct sequence also matched an 
additional fragment ion in the preliminary scoring procedure 
identifying 70% of the predicted fragment ions for this amino 
acid sequence in the pre-processed spectrum. These matches 
are determined as part of the preliminary scoring procedure. 
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Example #2 

To examine the complexity of the mixture of 
peptides obtained by proteolysis of the total proteins from S. ' 
cerevisiae cells, 10* cells were grown and harvested. After 
lysis, the total proteins were contained in ~9 mL of solution. 
A 0.5 mL aliquot was removed for proteolysis with the enzyme 
trypsin ♦ From this solution two microliters were directly 
injected onto a micro-LC (liquid chromatography) column for MS 
analysis. In a complex mixture of peptides it is conceivable 
that multiple peptide ions may exist at the same m/z and 
contribute to increased background, complicating MS/MS 
analysis and interpretation. To test the ability to obtain 
sequence information by MS/MS from these complex mixtures of 
peptides, ions from the mixture were selected with on-line 
MS/MS analysis. In no case were the spectra contaminated with 
fragment ions from other peptides. A partial list of the 
identified sequences is presented in Table 3. 



20 



25 



30 



S. cerevisiae Prot e i n 



Table 3 



Sea. ID No 



35 



enoiase 

hypusine containing protein HP2 

phosphoglycerate kinase 

BMH1 gene product 

pyruvate kinase 

phosphoglycerate kinase 

hexokinase 

enoiase 

enoiase 



4 
5 
6 
7 
8 
9 

10 
11 



Amino acid Sequence 

D P FAE DDW EA\ 

APEGELGDSLQTAFDEGK 

TGGGASLELLEGK 

QAFDDAIAELDTLSEESYK 

IPAGWQGLDNGPSER 

LPGTDVDLPALSEK 

IEDDPFENLEDTDDDFQK 
EEALDL1VDAIK 
NPTVEVELTTEK 



40 



45 



The MS/MS spectra presented in Table l were 
interpreted using the described database searching method. 
This method serves as a data pre-f ilter to match MS/MS spectra 
to previously determined amino acid sequences* Pre-f iltering 
the data allows interpretation efforts to be focused on 
previously unknown amino acid sequences. Results for some of 
the MS/MS spectra are shown in Table 4. No pre-assigning of 
sequence ions or manual interpretation is required prior to 
the search. However, the sequences must. exist in the 
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database. The algorithm first pre-processed the MS/MS data 
and then compared all the amino acid sequences in the database 
within ±1 amu of the mass of the precursor ion of the MS/MS 
spectrum. The predicted fragmentation patterns of the amine* 
acid sequences within the mass tolerance were compared to the 
experimental spectrum. Once an amino acid sequence was within 
this mass tolerance, a final closeness-of-f it measure was 
obtained by reconstructing the MS/MS spectra and performing a 
correlation analysis to the experimental spectrum. Table 4 
lists a number of spectra used to test the efficacy of the 
algorithm. 

The computer program described above has been 
modified to analyze the MS/MS spectra of phosphorylated 
peptides. In this algorithm all types of phosphorylation are 
considered such as Thr, Ser, and Tyr. Amino acid sequences 
are scanned in the database to find linear stretches of 
sequence that are multiples of 80 amu below the mass of the 
peptide under analysis. In the analysis each putative site of 
phosphorylation is considered and attempts to fit a 
reconstructed MS/MS spectrum to the experimental spectrum are 
made. 
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Table 4 

List of results obtained searching genpept and 
species specific databases using MS/MS spectra for the 
respective peptides* 



Amino Acid Sequence 



NO 


. Mass 


of Peptides used Seq. 


Genpept 


GenoeDt 




in the search id 


No. 


Database 


Database** 


SDecif i 


1 


1734 .9 


DLRS WTAADTAAQ I S Q 


12 


1 


l 


X 


2 


1749 


DLRSWTAADTAAQ I TQ 


1 "5 

•1- J 


1 . 


l 


1 


3 


1186.5 


MATPLLMQALP 


14 






X J 


4 


1317.7 


MATPLLMQALP 
EGVNDNEEGFFSAR 1 ' 2 . 


14 


61 


61 


1 *7 


5 


1571 .6 


15 


1* 


1 


X 


6 


1571.6 


EGVNDNEEGFFSAR 1 ' 2 


15 


* 

l 


1 


X 


7 


1297.5 


DRVYIHPFHL f +2 ) 


X D 


1 




X 


8 


1297.5 


DRVYIHPFHL (+2)- 




2 




*3 
<. 


9 


1297.5 


DRVYIHPFHL ( +3 ) 
VEADVAGHGOD I L I R 2 




1 


1 


1 

X 


10 


1593 .8 


17 


1 




X 


11 


1393.7 


HGVTVLTALGAI LK 2 


1 A 


1 


1 


T 
X 


12 


1741 .8 


HSGQAEGYSYTDANIK 2 


19 


1 


1 


1 


13 


648 .8 


. HSGQAEG Y 1 + 1) 
MAFGGLK 2 ' 3 (+1) 


20 


1 


1 


1 

A. 


14 


723.9 


21 








15 


636.8 


GATLFK (+1) [QATLFG, KTLFK] 


22 


x* 


1* 


6 


16 


524 .6 


TEFK (+1 ) 


23 


1* 


1* 


5 


17 


1251.4 


DRNDLLTYLK* ' 2 


24 


5* 


5 


1 


18 


1194 .4 


VLVLDTDYKK 2 


25 


6 


6 


2 


19 


700.7 


CRGDSY 1 (CGRDSY) 


26 


3* 


1 


1 


20 


700.7 


CRGDSY 1 (+1) 
KGATLFK 2 


26 






7 


21 


764 .9 


27 


3 


3 


1 


22 


1169 .3 


TGPNLHGLFGR 


28 


1 


1 


1 


23 


1047 .2 


DRVYIHPF 


29 






7 
1 


24 


1139 .3 


TLLVGESATTF ( + 1 ) 


30 


1 


1 


25 


1189 .4 


RNVIPDSKY 


31 


1 


1 


1 


26 


613 .7 


SSPLPL{+1) 


32 


2 


4 


2 


27 


1323 .5 


LARNCQPNYW(C=161 . 17) 


33 


1 


1 


1 


28 


2496 .7 


AQSMGFINEDLSTSAQALMSDW 


34 


1 


1 


1 


29 


1551 .8 


VTL I H P I AMDDG LR 


35 


3* 


3 


1 


30 


1803 .0 


GGDTVTLNETDLTQ I PK 


36 


2 


2 


1 


31 


1172 .4 


VGEEVEIVGIK 


37 


1 


1 


1 


32 


2148 .5 


GWQVPAFTLGGEATD I WMR 


38 


1 


1 


1 


33 


2553.9 


VAS I S L PTS CAS AGTQ CLIS GWGNT K x 


39 




1 


1 


34 


1154 .3 


SSGTS YPDVLK 1 


40 




3 


1 


35 


1174.5 


TLNNDIMLIK 


41 


1 


1 


1 


36 


2274 .6 


S I VHPS YNSNTLNNDIMLI K 1 


42 




2 


1 



not present in the genpept database 

sequence appended to the human database, not originally in human 
database 

amino acid sequences added to database 

not in the top 100 answers 
peptide of similar sequence identified 



Example #3 

Much of the information generated by the genome 
projects will be in the form of nucleotide sequences. Those 
stretches of nucleotide sequence that can be correlated to a 
gene will be translated to a protein sequence and stored in 



WO 95/25281 



PCT/US95/03239 



28 

specific database (genpept) . The un-translated nucleotide 
sequences represent a wealth of data that may be relevant to 
protein sequences. The present invention will allow searching 
the nucleotide database in the same manner as the protein 
sequence databases. The procedure will involve the same 
algorithmic approach of cycling through the nucleotide 
sequence. The three-base codon will be converted to a protein 
sequence and the mass of the amino acids summed. To cycle 
through the nucleotide sequence, a one-base increment will be 
used for each cycle. This will allow the determination of an 
amino acid sequence for each of the three reading frames in 
one pass. For example, an MS/MS spectrum is generated for the 
sequence Asp-Leu-Arg-Ser-Trp-Thr-Ala [Seq. ID No. 43] 
((M+H)+=848) the algorithm will search the nucleotide sequence 
in the following manner. 

Nucleotide sequence from the database, 
nucleotides GCG AUC UCC GGU CUU GGA CUG CUC 
First pass through the sequence . 
nucleotides GCG AUC UCC GGU CUU GGA CUG CUC 
amino acids Ala lie Ser Gly Leu Gly Leu Leu 
Second pass through the sequence, 
nucleotides G CGA UCU CCG GUC UUG GAC UGC UC 
amino acids Arg Ser Pro Val Leu Gly Leu 

Third pass through the sequence, 
nucleotides GC GAU CUC CGG UCU UGG ACU GCU C 
amino acids Asp Leu Arg Ser Trp Thr Ala 

Fourth pass through the sequence, 
nucleotides GCG AUC UCC GGU CUU GGA CUG CUC 
amino acids lie Ser Gly Leu Gly Leu Leu 

As the sequence of amino acids match the mass of the peptide 
the predicted sequence ions will be compared to the MS/MS 
spectrum. From this point on the scoring and reporting 
procedures for the search will be the same as for a protein 
sequence database. 

In light of the above description, a number of 
advantages of the present invention can be seen. The present 
invention permits correlating mass spectra of a protein, 
peptide or oligonucleotide with a nucleotide or protein 
s quence database in a fashion which is relatively accurate , 





Sea. 




44 


Mass 


44 


74 3 


45 


Mass 


44 


741 


46 


Mass 


44 


846 


43 


Mass 


44 


672 


45 
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rapid, and which is amenable to automation (i.e., to operation 
without the need for the exercise of human judgment) . The 
present invention can be used to analyze peptides which are 
derived from a mixture of proteins and thus is not limited to 
analysis of intact homogeneous proteins such as those 
generated by specific and known proteolytic cleavage. 

A number of variations and modifications of this 
invention can also be used. The invention can be used in 
connection with a number of different proteins or peptide 
sources and it is believed applicable to any analysis using 
mass spectrometry and proteins. In addition to the examples 
described above, the present invention can be used for, for 
example, monitoring fermentation processes by collecting 
cells, lysing the cells to obtain the proteins, digesting the 
proteins, e.g. in an enzyme reactor, and analyzing by Mass 
spectrometry as noted above. In this example, the data could 
be interpreted using a search of the organism's database 
(e.g., a yeast database). As another example, the invention 
could be used to determine the species of organism from which 
a protein is obtained. The analysis would use a set of 
peptides derived from digestion of the total proteins. Thus, 
the cells from the organism would be lysed, the proteins 
collected and digested. Mass spectrometry data would be 
collected with the most abundant peptides. A collection of 
spectra (e.g., 5 to 10 spectra) would be used to search the 
entire database. The spectra should match known proteins of 
the species. Since this method would use the most abundant 
proteins in the cell, it is believed that there is a high 
likelihood the sequences for these organisms would be 
sequenced and in the database. In one embodiment, relatively 
few cells could be used for the analysis (e.g. , on the order 
of 10 4 - 10 5 ) . 

For example, methods of the invention can be used to 
identify microorganisms, cell surface proteins and the like. 
For identifying microorganisms, the procedure can employ 
tandem mass spectra obtained from peptides produced by 
proteolytic digestion of the cellular proteins. The complex 
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mixture of peptides produced is subjected to separation by 
HPLC on-line to a tandem mass spectrometer. As peptides elute 
off the column tandem mass spectra are obtained by selecting a 
peptide ion in the first mass analyzer, sending it into a 
collision cell, and recording the mass-to-charge (m/z) ratios 
of the resulting fragment ions in the second mass analyzer. 
This process is performed over the course of the HPLC analysis 
and produces a large collection of spectra (e.g., from 10 to 
200 or more) . Each spectrum represents a peptide derived the 
microorganism's protein (gene) pool and thus the collection 
can be used to develop one or more family, genus, species, 
serotype or strain-specific markers of the microorganism, as 
desired. 

The identification of the microorganism is performed 
using one of at least three software related techniques. In a 
first technique, a database search, the tandem mass spectra 
are used to search protein and nucleotide databases to 
identify an amino acid sequence which is represented by the 
spectrum. Identification of the organism is achieved when a 
preponderance of spectra obtained in the mass spectrometry 
analysis match to proteins previously identified as coming 
from a particular organism. Means for searching databases in 
this fashion are as described hereinabove. 

In a second technique a library search can be 
performed, such as if no solid matches are observed using the 
database search described above. In this approach the data 
set is compared to a pre-defined library of spectra obtained 
from known organisms. Thus, initially a library of peptide 
spectra is created from known microorganisms. The library of 
tandem mass spectra for micro-organisms can be constructed by 
any of several methods which employ LC-MS/MS. The methods can 
be used to vary the location cellular proteins are obtained 
from, and the amount of pre-purif ication employed for the 
resulting peptide mixture prior to LC-MS/MS analysis. For 
example, intact cells can be treated with a proteolytic enzyme 
such as trypsin, chymotrypsin, endoproteinase Glu-C, 
endoproteinase Lys-C, pepsin, etc. to digest the proteins 
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exposed on the cell surface. Pre-treatment of the intact 
cells with one or more glycosidases can be used to remove 
steric interference that may be created by the presence of 
carbohydrates on the cell surface. Thus, the pre-treatment ' 
with glycosidases may be used to obtain higher peptide yields 
during the proteolysis step. A second method to prepare 
peptides involves rupturing the cell membranes (e.g., by 
sonication, hypo-osmotic shock, f reeze-thawing, glass beads, 
etc.) and collecting the total proteins by precipitation, 
e.g., using acetone or the like. The proteins are resuspended 
in a digestion buffer and treated with a protease such as 
trypsin , chymotrypsin , endoproteinase glu-C , endoproteinase 
lys-C, etc. to create a mixture of peptides. Partial 
simplification of this mixture of peptides, such as by 
partitioning the mixture into acid and basic fractions or by 
separation using strong cation exchange chromatography, leads 
to several pools of peptides which can then be used in the 
mass spectrometry process. The peptide mixtures are analyzed 
by LC-MS/MS, creating a large set of spectra, each 
representing a unique peptide marker of the organism or cell 
type. 

The data are stored in the library in any of a 
variety of means, but conveniently in three sections, wherein 
one section is the peptide mass determined from the spectrum, 
a second section is information related to the organism, 
species, growth conditions, etc., and a third section contains 
the mass/intensity data. The data can be stored in a variety 
of formats, conveniently an ASCII format or in a binary 
format. 

To perform the library search spectra are compared 
by first determining whether the mass of the peptide is within 
a preset mass tolerance (typically about ± 1-3 amu) of the 
library spectrum; a cross-correlation function as described 
hereinabove is used to obtain a quantitative value of the 
similarity or closeness-of-f it of the two spectra. The 
process is similar to the database searching algorithm except 
a spectrum is not reconstructed for the amino acid sequence. 



WO 95/25281 



PCTYUS95/03239 



32 

To provide a set of comparison spectra the tandem mass 
spectrum can be used to search a small (e.g., "100 protein 
sequences) randomly generated sequence database. This 
provides a background against which similarity is compared and 
to generate a normalized score. 

A third related technique for identifying a 
microorganism or cell involves de novo interpretation to 
determine a set of amino acid sequences that have the same 
mass as the peptide represented by the spectrum. The set of 
amino acid sequences is limited by using the spectral pre- 
processing equation 1, above, to rank the sequences. This set 
of amino acid sequences then serves as the database for use in 
the searching method described hereinabove. An amino acid 
sequence is thereby derived for a tandem mass spectrum that is 
not contained in the organized databases. By using 
phylogenetic analysis of the determined amino acid sequences 
they can be placed within a species, genus or family and a 
classification of the microorganism is thereby accomplished. 

The methodology described above has applications in 
addition to identifying microorganisms. For example, cDNA 
sequencing can be carried out using conventional means to 
obtain partial sequences of genes expressed in particular cell 
lines, tissue types or microorganisms. This information then 
serves as the database for the subsequent analyses. The 
approach described above for the digesting proteins exposed on 
the cell surface by enzymatic digestion can be used to 
generate a collection of peptides for LC-MS/MS analysis. The 
resulting spectra are used to search the nucleotide sequences 
in all 6 reading frames to match amino acid sequences to the 
MS/MS spectra. The amino acid sequences identified represent 
regions of the cell surface proteins exposed to the 
extracellular space. This method provides at least two 
additional pieces of information not directly obtainable from 
cDNA sequencing. First, the spectra identify the proteins 
residing on the membrane of the cells. Secondly, sidedness 
information is obtained about the folding of the proteins on. 
the cell surface. The peptide sequences matched to the 
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nucleotide sequence information identifies those segments of 
the protein sequence exposed extracellularly . 

The methods can also be used to interpret the MS/MS 
spectra of carbohydrates. In this method the carbohydrate (s) 
of interest is subjected to separation by HPLC on-line to a 
tandem mass spectrometer as with the peptides. The 
carbohydrates can be obtained from a complex mixture of 
carbohydrates or obtained from proteins, cells, etc. by 
chemical or enzymatic release. Tandem mass spectra are 
obtained by selecting a carbohydrate ion in the first mass 
analyzer, sending it into a collision cell, and recording the 
mass-to-charge (m/z) ratios of the resulting fragment ions in 
the second mass analyzer. This process is performed over the 
course of the HPLC analysis and produces a large collection of 
spectra (e.g. , from 10 to 200 or more) . The fragmentation 
patterns of the carbohydrate structures contained in the 
database can be predicted and a theoretical representation of 
the spectra can be compared to the pattern in the tandem mass 
spectrum by using the method described hereinabove. The 
carbohydrate structures analyzed by tandem mass spectrometry 
can thereby be identified. These methods can thus be used for 
characterization of the carbohydrate structures found on 
proteins, cell surfaces, etc. 

The present invention can be used in connection with 
diagnostic applications, such as described above and in 
Example 2 . Another example involves identifying virally 
infected cells. Success of such an approach is believed to 
depend on the relative abundance of the viral proteins versus 
the cellular proteins, at least using present equipment. If 
an antibody were produced to a specific region of a protein 
common to certain pathogens, the mixture of proteins could be 
partially fractionated by passing the material over an 
immunoaf f inity column. Bound proteins are eluted and 
digested. Mass spectrometry generates the data to search a 
database. One important element is finding a general handle 
to pull proteins from the cell. This approach could also be 
used to analyze specific diagnostic proteins. For example, if 
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a certain protein variant is known to be present in some form 
of cancer or genetic disease, an antibody could be produced to 
a region of the protein that does not change. An * 
immunoaffinity column could be constructed with the antibody 
to isolate the protein away from all the other cellular 
proteins. The protein would be digested and analyzed by 
tandem mass spectrometry. The database of all the possible 
mutations in the protein "could be maintained and the 
experimental data analyzed against this database. 

One possible example would be cystic fibrosis. This 
disease is characterized by multiple mutations in CFTR 
protein. One mutation is responsible for about 70% of the 
cases and the other 30% of the cases result from a wide 
variety of mutations. To analyze these mutations by genetic 
testing would require many different analyses and probes. In 
the assay described above, the protein would be isolated and 
analyzed by tandem mass spectrometry. All the mutations in 
the protein could be identified in an assay based on 
structural information. The database used would preferably 
describe all the known mutations. Implementation of this 
approach is believed to involve significant difficulties. The 
amount of protein required could be so large that it would be 
impractical to obtain from a patient. This problem may be 
overcome as the sensitivity of mass spectrometry improves in 
the future. A protein such as CFTR is a transmembrane 
protein, which are typically very difficult to manipulate and 
digest. The approach described could be used for any 
diagnostic protein. The data would be highly specific and the 
data analysis essentially automated. 

It is believed that the present invention can be 
used with any size peptide. The process requires that 
peptides be fragmented and there are methods for achieving 
fragmentation of very large proteins. Some such techniques 
are described in Smith et al., "Collisional Activation and 
Collision-Activated Dissociation of Large Multiply Charged 
Polypeptides and Proteins Produced by Electrospray Ionization" 
■7, Amer. Soc. Mass Spect. I: 53-65 (1990). The present method 
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can be used to analyze data derived from intact proteins, in 
that there is no theoretical or absolute practical limit to 
the size of peptides that can be analyzed according to this 
invention. Analysis using the present invention has been 
performed on peptides at least in the size range from about 
400 amu (4 residues) to about 2500 amu (26 residues) . 

In described embodiments candidate sub-sequences are 
identified and fragment spectra are predicted as they are 
needed, at the time of doing the analysis. If sufficient 
computational resources and storage facilities are available 
to perform some or all of the calculations needed for 
candidate sequence identification (such as calculation of sub- 
sequence masses) and/or spectra prediction (such as 
calculation of fragment masses) , storage of these items in a 
database can be employed so that some or all of these items 
can be looked up rather than calculated each time they are 
needed. 

While the present invention has been described by 
way of the preferred embodiment and certain variations and 
modifications, other variations and modifications of the 
present invention can also be used, the invention being 
described by the following claims. 
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WHAT IS CLAIMED IS : 

1 1, A method for correlating a peptide fragment 

2 mass spectrum with amino acid sequences derived from a # 

3 database of sequences, comprising: , , 

4 storing data representing a first mass spectrum of a 

5 plurality of fragments of at least a first peptide; 

6 calculating a plurality of predicted mass spectra of 

7 at least a portion of a plurality of said sequences in said 

8 database of sequences; and 

9 calculating at least a first measure for each of 

10 said plurality of predicted mass spectra, said first measure 

11 being an indication of the closeness-of-f it between said first 

12 mass spectrum and each of said plurality of mass spectra. 

1 2 - A method, as claimed in claim i, wherein said 

2 first mass spectrum is provided from a tandem mass 

3 spectrometer device. 

1 3. A method, as claimed in claim 2, wherein the 

2 tandem mass spectrometer is one of a triple guadrupole mass 
spectrometer, a Fourier-transform cyclotron resonance mass 
spectrometer, a tandem time-of -flight mass spectrometer and a 

5 quadrupole ion trap mass spectrometer. 

1 4. A method, as claimed in claim 1, wherein said 

2 database of sequences is a database of amino acid sequences of 

3 a plurality of proteins. 

1 5 * A method, as claimed in claim 1, wherein said 

2 database of sequences is a nucleotide database. 

1 6. A method, as claimed in claim 1, further 

2 comprising selecting a first plurality of sub-sequences from 

3 said database of sequences, wherein said step of calculating a 

4 plurality of predicted mass spectra includes calculating at 

5 least one predicted mass spectrum for ach of said selected 

6 first plurality of sub-sequences. 
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7. A method, as claimed in claim 1, wherein said 
step of calculating a first measure includes selecting those 
values from said first mass spectrum having an intensity 
greater than a predetermined threshold. 

8.. A method, as claimed in claim 1, further 
comprising normalizing said first spectrum prior to said step 
of calculating at least a first measure. 

9. A method, as claimed in claim 1, wherein said 
step of calculating a plurality of predicted mass spectra 
includes calculating predicted mass spectra for only a portion 
of said sequence database. 

10. A method, as claimed in claim 9, wherein said 
first peptide is derived from a protein which is obtained from 
a first organism and wherein said protein of said sequence 
database is the portion containing sequences for proteins 
found in said first organism. 

11. A method, as claimed in claim 2 wherein a first 
mass spectrometer of said tandem mass spectrometer device is 
used to separate-out a component having a first mass, an 
activation device of said mass spectrometer device is used to 
fragment said first component and a second mass spectrometer 
of said tandem mass spectrometer device is used provide said 
first mass spectrum.. 

12. A method, as claimed in claim 1, wherein said 
first peptide is isolated by chromatography. 

13. A method, as claimed in claim 1, wherein said 
data representing said first mass spectrum includes a 
plurality of mass-charge pairs. 
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1 14. A method, as claimed in claim 1, wherein said 

2 step of calculating predicted mass spectra comprises: 

3 deriving a plurality of masses from portions of said 

4 plurality of sequences, each mass equal to the mass of a 

5 peptide fragment which corresponds to a portion of a sequence 

6 in said plurality of sequences; 

selecting those masses, among said plurality of 

masses, which are within a predetermined mass tolerance of the 
9 mass of said first peptide and storing an indication of which 

10 portion of which sequence each of said selected masses 

11 corresponds to, to provide a plurality of candidate sequence 

12 portions; and 

13 calculating a plurality of mass-charge pairs for 
each of said candidate sequence portions, each of said mass- 
charge pairs having a mass substantially equal to the mass of 
a peptide fragment corresponding to a portion of one of said 

17 candidate sequence portions. 

1 15. A method, as claimed in claim 1, wherein said 

2 first measure comprises a correlation coefficient. 
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16. A method, as claimed in claim 1, wherein said 
step of calculating a first measure comprises: 

calculating a preliminary score for each of said 

4 plurality of candidate sequence portions; 

5 identifying a plurality of primary candidate 
portions which have a preliminary score which is greater than 
at least one candidate sequence which is not identified as a 

8 primary candidate portion; and 

9 calculating a correlation coefficient for each of 
10 said primary candidate portions. 
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17. A method, as claimed in claim 8, wherein each 
of said plurality of mass spectra and said first mass spectrum 
includes a plurality of mass-charge pairs, each mass-charge 
pair having an intensity value, and further comprising the 
step of identifying, for each of said plurality of mass 
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spectra, a set of matched fragments which have less than a 
predetermined difference from corresponding fragments in said 
first mass spectrum; and 

wherein said preliminary score is the number of 
fragments of a predicted mass spectrum in said set of matched 
fragments multiplied by the sum of the intensity values for 
the mass-charge pairs corresponding to said matched fragments. 

18. A method for interpreting the mass spectrum of 
an oligonucleotide comprising: 

providing a library of nucleotide sequences; 

storing, in a database, a plurality of nucleotide 
sub-sequences from said library, said plurality including all 
sequences smaller than n-mers; 

storing data representing a first mass spectrum of a 
plurality of fragments of said oligonucleotide; 

calculating predicted mass spectra for each of said 
plurality of nucleotide sub-sequences ; and 

calculating at least a first closeness-of-f it 
measure for each of said predicted mass spectra, with respect 
to said first mass spectrum. 

19. A method, as claimed in claim 18, wherein n is 

10. 

20. A method for determining whether a peptide in a 
mixture of proteins is homologous to a portion of any of a 
plurality of proteins specified by an amino acid sequence in a 
database of sequences, comprising: 

using a tandem mass spectrometer to receive a 
plurality of peptides obtained from said mixture of proteins, 
to select at least a first peptide from said mixture of 
peptides, to fragment said first peptide and to generate a 
peptide fragment mass spectrum; 

storing data representing said first mass spectrum; 

and 



12 
13 
14 
15 



1 
2 



1 
2 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 



15 
16 



WO 95/25281 t»^,™_ 

PCTYDS95/03239 



40 

correlating said mass spectrum with an amino acid 
sequence in said database of sequences, to determine the 
correspondence of a protein specified in said sequence 
database with a protein in said mixture of proteins. 



1 21. A method, as claimed in claim 20, wherein said 

2 step of correlating includes predicting at least one mass 

3 «=.^~i ^ . 



spectrum from said amino acid sequence. 



22. A method according to claim 20 wherein the 
mixture of proteins is obtained from a cell or microorganism 
3 to be identified. 



23. A method according to claim 22, wherein the 
mixture of proteins is obtained by proteolytic digestion of 
3 cellular proteins. 



1 24. The method of claim 23, wherein the cellular 

2 proteins are extracellular. 



25. A method for identifying an organism of 
interest by determining whether a mass spectrum or a plurality 
of mass spectra of peptides obtained from the organism or 
components thereof to be identified is contained in a library 
of spectra of known organisms , comprising: 

using a tandem mass spectrometer to receive a 
plurality of peptides obtained from a mixture of proteins 
obtained from said organism to be identified, to select at 
least a first peptide from said plurality of peptides, to 
fragment said first peptide and to generate a peptide ' fragment 

mace c^Anfvm^ . 



11 mass spectrum; 

12 storing data representing said first mass spectrum- 

13 and . 

14 



correlating said mass spectrum with a mass spectrum 
in said library of spectra of known organisms to determine the 
correspondence of said spectra with the spectra obtained from 
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peptides obtained from the organism to be identified, thereby- 
identifying said organism. 

26. The method of claim 25, wherein the organism' to 
be identified is a bacterium, fungus or virus. 

27. The method according to claim 25, wherein the 
mixture of proteins is obtained by enzymatic digestion of the 
organism's proteins. 

28. A method for characterizing a carbohydrate 
structure of interest from a mixture of carbohydrates, 
comprising: 

using a tandem mass spectrometer to receive a 
plurality of carbohydrates obtained from the mixture of 
carbohydrates, to select at least a first carbohydrate ion 
from the mixture of carbohydrates in a first mass analyzer of 
the tandem mass spectrometer, to fragment said first 
carbohydrate and to generate a carbohydrate fragment mass 
spectrum; 

storing data representing said first mass spectrum; 

and 

correlating said mass spectrum with a database of 
spectra of known carbohydrates, to determine the 
correspondence of a carbohydrate specified in said 
carbohydrate database with a carbohydrate in said mixture of 
carbohydrates, thereby characterizing the structure of the 
carbohydrate of interest. 

29. The method of claim 28, wherein the mixture of 
carbohydrates is obtained from a glycosylated protein of 
interest. 

30. The method of claim 29, wherein the mixture of 
carbohydrates is obtained from a glycosylated protein of 
interest by chemical or enzymatic release from the protein. 
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